126 research outputs found
Training with More Confidence: Mitigating Injected and Natural Backdoors During Training
The backdoor or Trojan attack is a severe threat to deep neural networks
(DNNs). Researchers find that DNNs trained on benign data and settings can also
learn backdoor behaviors, which is known as the natural backdoor. Existing
works on anti-backdoor learning are based on weak observations that the
backdoor and benign behaviors can differentiate during training. An adaptive
attack with slow poisoning can bypass such defenses. Moreover, these methods
cannot defend natural backdoors. We found the fundamental differences between
backdoor-related neurons and benign neurons: backdoor-related neurons form a
hyperplane as the classification surface across input domains of all affected
labels. By further analyzing the training process and model architectures, we
found that piece-wise linear functions cause this hyperplane surface. In this
paper, we design a novel training method that forces the training to avoid
generating such hyperplanes and thus remove the injected backdoors. Our
extensive experiments on five datasets against five state-of-the-art attacks
and also benign training show that our method can outperform existing
state-of-the-art defenses. On average, the ASR (attack success rate) of the
models trained with NONE is 54.83 times lower than undefended models under
standard poisoning backdoor attack and 1.75 times lower under the natural
backdoor attack. Our code is available at
https://github.com/RU-System-Software-and-Security/NONE
NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models
Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor
attacks against prompt-based models consider injecting backdoors into the
entire embedding layers or word embedding vectors. Such attacks can be easily
affected by retraining on downstream tasks and with different prompting
strategies, limiting the transferability of backdoor attacks. In this work, we
propose transferable backdoor attacks against prompt-based models, called
NOTABLE, which is independent of downstream tasks and prompting strategies.
Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing
an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It
activates the backdoor by pasting input with triggers to reach
adversary-desired anchors, achieving independence from downstream tasks and
prompting strategies. We conduct experiments on six NLP tasks, three popular
models, and three prompting strategies. Empirical results show that NOTABLE
achieves superior attack performance (i.e., attack success rate over 90% on all
the datasets), and outperforms two state-of-the-art baselines. Evaluations on
three defenses show the robustness of NOTABLE. Our code can be found at
https://github.com/RU-System-Software-and-Security/Notable
Alteration-free and Model-agnostic Origin Attribution of Generated Images
Recently, there has been a growing attention in image generation models.
However, concerns have emerged regarding potential misuse and intellectual
property (IP) infringement associated with these models. Therefore, it is
necessary to analyze the origin of images by inferring if a specific image was
generated by a particular model, i.e., origin attribution. Existing methods are
limited in their applicability to specific types of generative models and
require additional steps during training or generation. This restricts their
use with pre-trained models that lack these specific operations and may
compromise the quality of image generation. To overcome this problem, we first
develop an alteration-free and model-agnostic origin attribution method via
input reverse-engineering on image generation models, i.e., inverting the input
of a particular model for a specific image. Given a particular model, we first
analyze the differences in the hardness of reverse-engineering tasks for the
generated images of the given model and other images. Based on our analysis, we
propose a method that utilizes the reconstruction loss of reverse-engineering
to infer the origin. Our proposed method effectively distinguishes between
generated images from a specific generative model and other images, including
those generated by different models and real images
Rethinking the Reverse-engineering of Trojan Triggers
Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks.
Reverse-engineering methods can reconstruct the trigger and thus identify
affected models. Existing reverse-engineering methods only consider input space
constraints, e.g., trigger size in the input space. Expressly, they assume the
triggers are static patterns in the input space and fail to detect models with
feature space triggers such as image style transformations. We observe that
both input-space and feature-space Trojans are associated with feature space
hyperplanes. Based on this observation, we design a novel reverse-engineering
method that exploits the feature space constraint to reverse-engineer Trojan
triggers. Results on four datasets and seven different attacks demonstrate that
our solution effectively defends both input-space and feature-space Trojans. It
outperforms state-of-the-art reverse-engineering methods and other types of
defenses in both Trojaned model detection and mitigation tasks. On average, the
detection accuracy of our method is 93\%. For Trojan mitigation, our method can
reduce the ASR (attack success rate) to only 0.26\% with the BA (benign
accuracy) remaining nearly unchanged. Our code can be found at
https://github.com/RU-System-Software-and-Security/FeatureRE
How to Detect Unauthorized Data Usages in Text-to-image Diffusion Models
Recent text-to-image diffusion models have shown surprising performance in
generating high-quality images. However, concerns have arisen regarding the
unauthorized usage of data during the training process. One example is when a
model trainer collects a set of images created by a particular artist and
attempts to train a model capable of generating similar images without
obtaining permission from the artist. To address this issue, it becomes crucial
to detect unauthorized data usage. In this paper, we propose a method for
detecting such unauthorized data usage by planting injected memorization into
the text-to-image diffusion models trained on the protected dataset.
Specifically, we modify the protected image dataset by adding unique contents
on the images such as stealthy image wrapping functions that are imperceptible
to human vision but can be captured and memorized by diffusion models. By
analyzing whether the model has memorization for the injected content (i.e.,
whether the generated images are processed by the chosen post-processing
function), we can detect models that had illegally utilized the unauthorized
data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate
the effectiveness of the proposed method in detecting unauthorized data usages
FairNeuron: Improving Deep Neural Network Fairness with Adversary Games on Selective Neurons
With Deep Neural Network (DNN) being integrated into a growing number of
critical systems with far-reaching impacts on society, there are increasing
concerns on their ethical performance, such as fairness. Unfortunately, model
fairness and accuracy in many cases are contradictory goals to optimize. To
solve this issue, there has been a number of work trying to improve model
fairness by using an adversarial game in model level. This approach introduces
an adversary that evaluates the fairness of a model besides its prediction
accuracy on the main task, and performs joint-optimization to achieve a
balanced result. In this paper, we noticed that when performing backward
propagation based training, such contradictory phenomenon has shown on
individual neuron level. Based on this observation, we propose FairNeuron, a
DNN model automatic repairing tool, to mitigate fairness concerns and balance
the accuracy-fairness trade-off without introducing another model. It works on
detecting neurons with contradictory optimization directions from accuracy and
fairness training goals, and achieving a trade-off by selective dropout.
Comparing with state-of-the-art methods, our approach is lightweight, making it
scalable and more efficient. Our evaluation on 3 datasets shows that FairNeuron
can effectively improve all models' fairness while maintaining a stable
utility
CILIATE: Towards Fairer Class-based Incremental Learning by Dataset and Training Refinement
Due to the model aging problem, Deep Neural Networks (DNNs) need updates to
adjust them to new data distributions. The common practice leverages
incremental learning (IL), e.g., Class-based Incremental Learning (CIL) that
updates output labels, to update the model with new data and a limited number
of old data. This avoids heavyweight training (from scratch) using conventional
methods and saves storage space by reducing the number of old data to store.
But it also leads to poor performance in fairness. In this paper, we show that
CIL suffers both dataset and algorithm bias problems, and existing solutions
can only partially solve the problem. We propose a novel framework, CILIATE,
that fixes both dataset and algorithm bias in CIL. It features a novel
differential analysis guided dataset and training refinement process that
identifies unique and important samples overlooked by existing CIL and enforces
the model to learn from them. Through this process, CILIATE improves the
fairness of CIL by 17.03%, 22.46%, and 31.79% compared to state-of-the-art
methods, iCaRL, BiC, and WA, respectively, based on our evaluation on three
popular datasets and widely used ResNet models
- …